Week 9 - Discussion Questions

These are example discussion points for you to think about before class. You are not expected to engage with all of them — pick the ones that speak most directly to your own research, and bring two or three rough answers to the in-class session. The full description of how to use these pages, including what the question tags mean, is on the Week 1 Discussion page.

Sub-lessons

The Trajectory of LLM Capabilities

Calibrate The May 2026 frontier snapshot is explicitly dated and will be wrong within months. Pick the single most recent model release in your area and place its capabilities against the snapshot. Where does the picture still hold, and where has it already moved?
Apply For your own work, which family of models — closed frontier, open-weights, specialist — is the right default? When would you swap defaults?
Critical The benchmark numbers cited (GPQA, SWE-bench, FrontierMath, MCP-Atlas) all show extraordinary gains. The lesson treats them as carefully bounded. Where, honestly, do you find yourself reading them as more authoritative than the lesson means them to be?
Connect Week 1's Current Generative AI Landscape page offered an earlier snapshot of roughly the same picture. Compare the two snapshots side by side. Which of the changes between Week 1 and Week 9 are quantitative (more capability of the same kind) and which are qualitatively new? What does the answer say about which frontier-trajectory claims are best treated as “real and accelerating” versus “real but uneven”?

Where AI Is Now Genuinely Strong

Calibrate Pick the strongest claim in this sub-lesson about a current AI capability. Find one paper or demonstration that supports it cleanly, and one that complicates it. Which complicates the claim most?
Apply Identify a concrete task in your own work where the lesson's “now genuinely strong” framing matches your direct experience, and one where the framing oversells what you have seen. Be specific.
Critical “Genuinely strong at X” tends to be argued from impressive demonstrations. What would the same case look like argued from a representative sample of failed attempts?
Connect Weeks 5, 6, 7, and 8 each made specific claims about where AI is genuinely useful (literature review, drafting, data analysis, multimodal). Which of those week-specific claims now feel solid under this lesson's “genuinely strong” bar, and which feel weaker once you place them next to it?

Three Categories of Failure

Calibrate Take a recent failure of an AI tool you have personally encountered. Place it in the patched / reduced-but-persistent / structural taxonomy. What evidence would move it between categories?
Apply For your own research workflow, which category of failure is the most operationally costly — the one most worth investing verification effort in mitigating?
Critical The structural category implies some failures will not be solved by scale. Steel-man the opposite view: where might today's structural-looking failures turn out to be next year's patched ones?
Connect The harness-engineering result (69.7%→77.0% on the same model) suggests that some apparent failures are not in the model at all. Read this against the “patched” / “persistent” / “structural” categories. Does it shift the categorisation, or sit outside it?

Illusions of Understanding

Calibrate The lesson describes specific failure modes (hallucination, sycophancy, benchmark contamination) that produce illusions of understanding. Pick one example from a recent paper or model output where you initially mistook fluency for understanding. What was the actual tell that broke the illusion?
Apply Design a personal “sycophancy check” you would apply to your AI sessions. Make it specific enough that you would actually use it on a real task.
Critical Benchmark contamination is a deep, persistent problem. Is the right response to (a) build harder, fresher benchmarks faster than they can be polluted, (b) abandon benchmarks for capability claims, or (c) use them only as upper bounds? Pick one.
Connect Week 5 covered citation hallucination as a specific failure; Week 7 covered silent errors in data analysis; Week 8 covered the “correct answer, wrong reasoning” problem in scientific image analysis. Place all three of those concrete failure stories inside the “illusions of understanding” frame here. Do they all share a single underlying mechanism, or does the frame paper over genuinely different failure modes?

Verification Protocols for a Moving Target

Calibrate Apply the lesson's verification protocol to one piece of AI output you have used in real research in the past two weeks. Where in the protocol did you actually stop, and where did you skip steps you would now go back and do?
Apply Draft your minimal personal verification protocol — the version you would not skip even under deadline pressure. What is on it, and what did you have to leave off?
Critical Verification adds friction. When does that friction stop being protective and start being avoidant procrastination?
Connect Weeks 5, 6, 7, and 8 each gave you a domain-specific verification protocol (citations; writing audit; code review; multimodal output check). Take the verification protocol from this lesson and ask: does it sit above those four (as the principle they each instantiate), or is it just a fifth domain-specific protocol that happens to be more general?

Hands-On Activities and Assessment

Calibrate Apply the worked exercises to a real, current research task — not a contrived example. Where does the exercise design itself reveal something about your habits, regardless of what the AI does?
Apply The Week 9 assessment is “an explicitly dated snapshot.” Sketch the metadata you would want on every AI-assisted artefact you produce in research so that “dated snapshot” becomes your default mode.
Critical The free-tools-only constraint is presented as an equity move. Is the trade-off between equity (everyone on the same footing) and capability (some students could get further with paid tools) being struck in the right place? What would you change?
Connect Compare the hands-on activities here with the equivalent activities at the end of Weeks 5, 6, 7, and 8. The earlier weeks asked you to verify in a specific domain; this week asks you to verify the verification itself. Which earlier hands-on session most directly prepared you for this one, and which one you now wish you had taken more seriously at the time?